1 Introduction to the dataset used

The original dataset contains RNA-seq count data generated by profiling peripheral blood from 62 COVID-19 patients (COVID-19 group) and 24 healthy controls (HC group) via bulk RNA-seq. It is publicly available on Gene Expression Omnibus database (GEO) under accession number GSE152641[1] as a count matrix with 20460 unique Entrez gene ID’s along the rows and 86 samples (biological replicates) along the columns. The original study not only performed expression analysis on genes of these 86 samples, but also compiled expression data of prior studies on six viruses: influenza, RSV, HRV, Ebola, Dengue, and SARS in an attempt to isolate COVID-19 biomarkers from other viral infections. Here we will only perform an expression analysis for the RNA-seq count matrix provided by this study. The count matrix file was downloaded using GEOquery[2] version 2.55.1. TMM normalisation was applied to the data using edgeR[3][4] version 3.30.0 before exploratory data analysis. During the normalisation process, 6035 genes had expression values that met the criterium of being outliers as per the edgeR protocol[5] and hence were removed. The mapping from Entrez ID against the official gene symbol (HUGO/HGNC) of the genes was done using annotate[6] and database org.Hs.eg.db[7]. The mapping resulted in one duplicate record of gene ZFHX3, to which the gene of reference ID 388289 and gene 463 were both mapped:

Entrez ID Gene Symbol Status
388289 C16orf47 Replaced with Gene ID: 463, HUGO symbol: ZFHX3

Since the expression values for these two ID’s are identical, and ID 388289 is obsolete, I decided to remove the record for ID 388289 from the normalised expression matrix; retaining it could introduce noises in further analysis.

For the exploratory analysis, edgeR[3][4] was also used to produce a multidimensional scaling (MDS) plot, a Biological Coefficient of Variation (BCV) plot, and a mean-variance plot to present evidence of differences between our biological replicates of the two groups (HC and COVID-19).

The MDS plot presents variation amongst samples based on the normalised gene expression. The distances between each pair of samples on the MDS plot suggest their leading logFC. The healthy control (HC) and the patients (COVID-19) samples form separate clusters along the plot dimension 1. There are however still several COVID-19 samples mixed in the HC cluster.

The BCV plot presents the estimated relative variability of expression between biological replicates; it illustrates the association between the biological CV and the average true gene abundance. The common dispersion line suggests all gene expression values vary by close to a BCV value of 0.5 amongst replicates.[8] The tagwise dispersion shows BCV values calculated individually for each gene. We observe that genes with higher true abundance (under the assumption that RNA-seq counts follow a negative binomial distribution) have lower BCV’s than genes with lower abundance. This imples that the higher the true expression level of a gene has, the lower its variation in our samples.

The mean-variance plot presents the modelling of the mean-variance relationship for the normalised expression values, which are split into bins by overall expression level in the model. The grey points represent the pooled genewise variances. The blue points on the mean-variance plot represent the estimated genewise variances. The red crosses represent the pooled genewise variances, while the dark red crosses represent the the average of the pooled variances for each bin of genes plotted against the average expression level of the genes in the bin. The blue line shows the mean-variance relationship for a negative binomial model with common dispersion.[3][4] We obeserve that all types of variances fit well along the negative binomial line.

Note that this result can only be reproduced using Bioconductor 3.11; using other versions of Bioconductor will lead to a different mapping of gene names.

2 Differential Gene Expression

2.1 Visualise with Heatmap: First Attempt

We will be using ComplexHeatmap[9] for heatmap visualisation of gene expression and circlize[10] to generate a colour gredient indicating the expression levels of genes.

Let’s try to plot out all the genes in a heatmap to see if there is any noticible pattern:

We observed no signs of differential expression.

2.2 Differential Expression Analysis by Quasi-Likelihood Methods (QLM)

2.2.1 Model Design

We hypothesized that COVID-19 status of samples was the only factor contributing to differential gene expression. Therefore, we model on group (status COVID-19 or HC) such that fitting this design matrix will tell us how the COVID-19 status of a sample explains his/her expression levels of genes.

(model_design <- model.matrix(~group, data = samples_by_group))
##    (Intercept) groupHC
## 1            1       1
## 2            1       1
## 3            1       1
## 4            1       1
## 5            1       1
## 6            1       1
## 7            1       1
## 8            1       1
## 9            1       1
## 10           1       1
## 11           1       1
## 12           1       1
## 13           1       1
## 14           1       1
## 15           1       1
## 16           1       1
## 17           1       1
## 18           1       1
## 19           1       0
## 20           1       0
## 21           1       0
## 22           1       0
## 23           1       0
## 24           1       0
## 25           1       0
## 26           1       0
## 27           1       0
## 28           1       0
## 29           1       0
## 30           1       0
## 31           1       0
## 32           1       0
## 33           1       0
## 34           1       0
## 35           1       0
## 36           1       0
## 37           1       0
## 38           1       0
## 39           1       0
## 40           1       0
## 41           1       0
## 42           1       0
## 43           1       0
## 44           1       0
## 45           1       0
## 46           1       0
## 47           1       0
## 48           1       0
## 49           1       0
## 50           1       0
## 51           1       0
## 52           1       0
## 53           1       0
## 54           1       0
## 55           1       0
## 56           1       0
## 57           1       0
## 58           1       0
## 59           1       0
## 60           1       0
## 61           1       0
## 62           1       0
## 63           1       0
## 64           1       0
## 65           1       0
## 66           1       0
## 67           1       0
## 68           1       0
## 69           1       0
## 70           1       0
## 71           1       0
## 72           1       0
## 73           1       0
## 74           1       0
## 75           1       0
## 76           1       0
## 77           1       0
## 78           1       0
## 79           1       0
## 80           1       0
## 81           1       1
## 82           1       1
## 83           1       1
## 84           1       1
## 85           1       1
## 86           1       1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Note that samples_by_group is a dataframe labeling samples with their groups:

samples_by_group

2.2.2 Multiple Hypothesis Testing for Differentially Expressed Genes by QLM

We first use edgeR[3][4] to estimate the dispersion. Note that the object d is a DGEList previously created when normalising the count matrix (see the data processing script for details).

d       <- edgeR::estimateDisp(d, model_design) # dispersion of normalised count
fit_qlm <- edgeR::glmQLFit(d, model_design)

In this step, QL dispersion was estimated with edgeR[3][4] by fitting a QL negative binomial (NB) generalised log-linear model to the DGEList data containing the normalised counts and the estimated NB dispersions (visualised with the BCV and mean-variance plots), along with the design matrix we just created.

We then use the fitted result to conduct genewise QL F-test for the coefficients of our defined sample groups.

qlf_SARS2vsHC <- edgeR::glmQLFTest(fit_qlm, coef = colnames(model_design)[2])

This step calculates the p-values for each of the genes in our expression set.

2.2.3 QL F-test P-values Comparison: FDR-correcred vs Uncorrected

Extract the top differentially expressed tags ranked by p-values from the result of the QL F-test and display the number of genes that passed the correction:

Number of genes with p-value < 0.05 Number of genes that pass after correction
8162 7260

The threshold for showing statistically significant evidence for differential expression was set to \(p < 0.05\) for each gene. We chose this threshold because we want to capture the genes that only have less 5% chance to show such differences in expression between groups if they were non-differentially expressed genes.

Note that edgeR by default uses the false discovery rate (FDR/Benjamini-Hochberg) method to correct p-values for false positive results. This method was applied because we need to control for the the liklihood of false positive results that would increase by chance with the increasing number of tests performed. We also set the threshold for the corrected p-value (FDR) to be \(<0.05\), because we want to capture genes that have false positive results for fewer than 5% of the significant tests.

2.2.4 Volcano Plot for QLF Results

A Volcano plot was chosen over an MA plot to show the amount of differentially expressed genes because a volcano plot can answer such questions as how many genes that passed the correction are up-regulated or are down-regulated, while an MA plot fails to visualise the association between the results of multiple hypothesis testing and differential expression. Note that the p-values we used for the volcano plot are the FDR-corrected p-values.

Here we highlighted two genes of interest: ACO1 and ATL3. These two genes were identified by the original study to be the most important COVID-19 signature genes. Thair .et al discovered that ACO1 and ATL3 are completely oppositely regulated between COVID-19 patients and non-COVID-19 patients; ACO1 is up-regulated in COVID-19 patients and down-regulated in patients with non-COVID-19 viral infections, while ATL3 is down-regulated in COVID-19 patients and up-regulated in non-COVID-19 patients. The authors also found that ZC3H13 is the most down-regulated genes amongst the COVID-19 signature genes.[1] The position on our volcano plot reflect that the result of our differential expression analysis by QLM coincides with this observation made by Thair .et al; ACO1 is down-regulated while ATL3 is up-regulated in the healthy control.

2.2.5 Heatmap for QLF top-hit Genes

The heatmap plotted from the top-hit genes (\(FDR < 0.05\)) from the differential expression analysis exhibits clusters of up-regulated and down-regulated expressions between the two group. This was not observed in the heatmap plotted using all the genes because here we have removed genes that show no significant evidence for differential expression for COVID-19 status (\(FDR \geq 0.05\)), which introduced noises in the first heatmap.

2.3 Differential Expression Analysis by Linear Models of MircroArray (Limma)

2.3.1 Multiple Hypothesis Testing

We are fitting the same model matrix as the one we used for QLM fitting:

fit_limma <- limma::lmFit(gene_counts_norm, design = model_design)

Then we apply empircal Bayes to compute differential expression:

fit_limma <- limma::eBayes(
    fit = fit_limma,
    trend = T # specific to RNA-seq, as we are not working with microarray data
)

Get the top hits ranked by P-values:

limma_output_hits <- limma::topTable(
    fit = fit_limma,
    coef = which(colnames(fit_limma$coefficients) == "groupHC"),
    adjust.method = "BH", # use Benjamni-Hochberg to correct for p-values
    number = nrow(gene_counts_norm)
    )
limma_output_hits <- limma_output_hits[order(limma_output_hits$P.Value), ]

2.3.2 P-value adjustment methods

For the results of empircal Bayes, we are able to select whatever p-value correction mathods R offers:

p.adjust.methods
## [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
## [6] "BY"         "fdr"        "none"

Note that BH and fdr are synonymous; they both refer to the Benjamini-Hochberg procedure for FDR control. According to R documentation for these adjustment methods, holm, hochberg, hommel, and bonferroni are designed to control the family-wise error rate (the probability of making one or more false postive discoveries; Type 1 error) and they control the probability of making such error in the multi-hypotheses testing. BH/fdr and BY on the other hand allow false positive results, but control the proportion out of the significant tests. This means the latter two methods are less conservative and stringent than the first four. Although hommel is a more powerful method, the R documentation states that the difference between the results of these two methods are small while the p-values of BH are faster to calculate. Therefore the FDR method was also chosen here.

2.3.3 Limma P-values Comparison: BH-adjusted vs Unadjusted

Check the number of genes that have P-values below the 0.05 cutoff and genes that have Benjamni-Hochberg corrected P-values below the cutoff:

Number of genes with p-value < 0.05 Number of genes that pass after correction
7997 7015

The result is quite similar to that of the QLM F-test, with slightly fewer genes that passed the threshold.

2.3.4 Volcano Plot for Limma top-hit Genes

The volcano plot behaves oddly for the limma result because of an outlier showing expression of -5000 fold change, and an outlier showing expression of higher than 1000 fold change; this could be attributed to the different methods edgeR and limma used to calculate log2 fold change. Genes cluster around the zero fold change level, and it becomes difficult to tell whether the signature genes are up- or down-regulated here.

2.3.5 Heatmap for Limma top-hit Genes

The heatmap for the limma top-hit genes has the same pattern of clusters as the one for the QLF top-hit genes. This again is because we only retain genes that have are significant for the differential expression analysis by limma-eBayes, and ploted the two groups side-by-side for comparison.

2.4 Comparing QL F-test and Limma Results

The p-values calculated by QLF and Limma methods exhibit a strong positive linear correlation, suggesting the similarity in the results of these two differential expression analyses. Moreover, both these two methods can capture a common subset of top hits; this explains the nearly identical clustering pattern in the two heatmaps.

3 Thresholded Over-Representation Analysis (ORA)

The thresholded ORA was performed using the top-hit differentially expressed genes from the QLF result. This is because we couldn’t confirm whether the signature genes ACO1 and ATL3 are up- or down-regulated in HC from the volcano plot for the limma result and compare against the observation made by Thair .et al. The unusally high and low fold changes of some genes in the limma result also makes me sceptical of its credibility.

3.1 ORA Method

To perform thresholded ORA, we used g:Profiler[11]. The reasons why we choose g:Profiler here is that, first, it provides R interface, while many of other ORA tools are only availabel as web servers; second, compared to the other popular R package clusterProfiler for enrichment analysis, g:Profiler allows users to query multiple annotation sources in one query and combine the results, while we could only query single annotation source one at a time with clusterProfiler. Although clusterProfiler has a set of more versatile plotting functions than g:Profiler R interface does, the out-of-box interactive Manhattan plot combining the results of different annotation sources is a better option for preliminary ORA.

3.2 Annotation Data

Since this is a pathway analysis, we are only interested in data sources for biological pathways. Hence, Reactome, Wikipathways, and KEGG were included. Although sometimes KEGG returns uninformative pathways, we still included it here to get a more comprehensive result. We will also include GO:Biological process (GO:BP) as our data source. Note that a biological process is different from a pathway; however, pathways are considered to collectively participate in biological processes. Therefore, we also included GO:BP as one of our data sources.

The databases used by the current release of g:Profiler are Ensembl 102, which is also the current version of GO database, and Ensembl Genomes 49. The currect release of Reactome used is version 75. Wikipathways is a comunity-maintained database and is being constantly updated. The current release of KEGG used is 97.0.

3.3 Threshold, P-value Correction, and IEA

The p-value threshold for the significance of the gene-set enrichment of the Fisher’s exact test calculated by g:Profiler was set to \(0.05\). The numbers of genesets that were returned with this threshold by each query would be presented as a summary below. We chose Benjamini–Hochberg FDR for p-value adjustment method, and exclude IEA results (Inferred from Electronic Annotation) since we want the human-curated results, which have better quality.

3.4 Thresholded Lists

We need to create 2 thresholded lists: one for the top-hit genes that were up-regulated, and one for those downregulated.

upregulated_genes <- entrez_to_gname[rownames(qlf_output_hits$table[which(
    qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 0
),])]
downregulated_genes <- entrez_to_gname[rownames(qlf_output_hits$table[which(
    qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
),])]

An overview of our top-hit genes by QLF:

Upregulated top-hit genes Downregulated top-hit genes
3260 4000

3.5 All Thresholded Differentially Expressed Genes

3.5.1 Query Summary

Genesets Returned Significant Genesets Ambiguous Genes Failed Genes
14777 865 63 24

Note that the threshold for the query is \(p < 0.05\).

3.5.2 Graphical Summary

3.5.3 Query Result Detail

There are too many pathways to show, and since GO terms are propagated, there might be terms that are too board and not so informative. Hence we limit the term size to fewer than 1000 genes and only look at those that passed p-value threshold and FDR correction:

3.5.3.1 GO Biological Process

3.5.3.2 Reactome

3.5.3.3 WikiPathways

3.5.3.4 KEGG

3.6 Upregulated Genes

3.6.1 Query Summary

Genesets Returned Significant Genesets Ambiguous Genes Failed Genes
11431 460 40 6

3.6.2 Graphical Summary

3.6.3 Query Result Detail

3.6.3.1 GO Biological Process

3.6.3.2 Reactome

3.6.3.3 WikiPathways

3.6.3.4 KEGG

3.7 Downregulated Genes

The workflow is the same as the enrichment analysis for the upregulated genes.

3.7.1 Query Summary

Genesets Returned Significant Genesets Ambiguous Genes Failed Genes
12987 1054 23 18

3.7.2 Graphical Summary

3.7.3 Query Result Detail

3.7.3.1 GO Biological Process

3.7.3.2 Reactome

3.7.3.3 WikiPathways

3.7.3.4 KEGG

3.8 ORA Result Comparison

Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

The number of gene sets returned appears to be proportional to the length of the query gene list. All these three results using three different gene lists reveal different information.

In terms of GO biological processes, the result for all differentially expressed genes involve regulation of neutrophil, leukocyte, and myeloid cells. The result for up-regulated genes does not seem to have many informative GO terms except for viral gene expression and viral transcription. The result for the down-regulated genes has terms involve type I interferon and defense response to virus.

In terms of REACTOME, pathways involving neutrophil regulation and myeloid leukocyte also appear in the result for all differentially expressed genes. The result for the up-regulated genes does not seem to have many informative terms either. The result for the down-regulated genes also shows signaling pathways for interferon.

In terms of wikipathways, the results for all differentially expressed genes shows T-Cell antigen Receptor (TCR) Signaling Pathway and Type II interferon signaling (IFN); pathways involving type I interferon appear too. The result for the up-regulated genes also have pathways involving TCR. The result for the down-regulated genes is similar to the previous two sources, showing pathways involving Type I interferon and Type II interferon.

In terms of KEGG, the result for all differentially expressed genes has “Coronavirus disease - COVID-19” has the top hit pathway, expected but not informative; many cancer-relative pathways also appear. The result for the up-regulated genes shows a pathway involving TCR. The KEGG pathways for the down-regulated genes are mostly cancer-related.

4 Overall Interpretation

Do the over-representation results support conclusions or mechanism discussed in the original paper?

The ORA results highly support the conclusions drawn in the original paper. The authors also found pathways such as neutrophil activation, innate immune response, immune response to viral infection, type-I interferon signaling, and cytokine production for 771 upregulated genes they discovered in COVID-19 patients; pathways they found for 1231 down-regulated genes include lymphocyte differentiation and T-cell activation and regulation, which also appear in our ORA result. The authors concluded that T-cell are suppressed while neutrophils are activated as an indicator of host response to COVID-19 represented in the transcriptomic changes.[1]

Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

There have been prior researches on host immune responses to COVID-19 showing findings that support this enrichment analysis result. An immune analysis on COVID-19 patients by Hadjadj et al. has shown findings that coincide with our ORA result. Their multiplex gene expression analysis showed an up-regulation of genes involved in type I interferon signaling in COVID-19 patients[12], while pathways involving type I interferon are present in our ORA results for down-regulated genes in healthy control. Anurag et al. observed increased total leukocyte count and differential neutrophil count in patients with severe COVID-19 infection[13], and GO terms leukocyte activation involved in immune response and neutrophil activation involved in immune response are present in the ORA result for all differentially expressed genes. Kalfaoglu B. et al. discovered a unique dynamics of T-cells in severe COVID-19 patients: T-cells become hyperactivated, proliferate and die rapidly before differentiating into Treg, while they have shown that the majority of Treg-type genes are regulated by TCR signaling[14]; this supports the involvement of T-cell signaling pathways in our ORA result.

5 Note

5.1 Normalisation Script

For details regarding the processing of the data, see GSE152641_data_processing.R under scripts directory of the repository.

5.2 Helper Functions

For details regarding implementation of helper functions used in this markdown file, see directory utils.

5.3 Ranked Genelist

Although we are performing thresholded ORA, and GESA is not in the scope of this assignment, a ranked gene-list was asked at the beginning of the assignment description and is provided below. A data file will also be created in /data directory once the Rmd file is compiled. Note that it will not be used for the ORA as it is not the appropriate tool for a ranked/un-thresholded gene-list.

gene_rank <- data.frame(
    genename = entrez_to_gname[rownames(qlf_output_hits$table)],
    F_stat   = -log(qlf_output_hits$table$FDR,
                    base = 10) * sign(qlf_output_hits$table$logFC)
)
gene_rank

Note that we define the rank to be \(-\log_{10}{\mathrm{FDR}} \ \cdot\ \operatorname{sgn}(\log{\mathrm{FC}})\). With the \(-\log_{10}{\mathrm{FDR}}\), the samller the (corrected) p-value for a gene is, the higher the rank of that gene, regardless of whether it is up- or downregulated. The other term \(\operatorname{sgn}(\log{\mathrm{FC}})\) take into account the factor of regulation of gene expression: if it is up-regulated, then it will be ranked from the top; if it is down-regulated, then it will be ranked from the bottom. Thereby we have genes that show the most statistically significant evidence of differential expression at the top for those upregulated, and the bottom of the list for those downregulated, while genes in the middle are least significant. Again, gene enrichment analysisfor the non-thresholded ranked list is not in the scope of this assignment and hence will not be run.

References

1. Thair SA, He YD, Hasin-Brumshtein Y, Sakaram S, Pandya R, Toh J, et al. Transcriptomic similarities and differences in host response between sars-cov-2 and other viral infections. iScience. 2021;24:101947. doi:https://doi.org/10.1016/j.isci.2020.101947.

2. Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23:1846–7. doi:10.1093/bioinformatics/btm254.

3. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26:139–40. doi:10.1093/bioinformatics/btp616.

4. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research. 2012;40:4288–97. doi:10.1093/nar/gks042.

5. Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, et al. Count-based differential expression analysis of rna sequencing data using r and bioconductor. Nature protocols. 2013;8:1765.

6. Gentleman R. Annotate: Annotation for microarrays. R package version. 2017;1:693.

7. Carlson M, Falcon S, Pages H, Li N. Org. Hs. Eg. Db: Genome wide annotation for human. R package version. 2017;3.

8. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research. 2012;40:4288–97. doi:10.1093/nar/gks042.

9. Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847–9. doi:10.1093/bioinformatics/btw313.

10. Gu Z, Gu L, Eils R, Schlesner M, Brors B. circlize implements and enhances circular visualization in R. Bioinformatics. 2014;30:2811–2. doi:10.1093/bioinformatics/btu393.

11. Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research. 2019;47:W191–8. doi:10.1093/nar/gkz369.

12. Hadjadj J, Yatim N, Barnabei L, Corneau A, Boussier J, Smith N, et al. Impaired type i interferon activity and inflammatory responses in severe covid-19 patients. Science. 2020;369:718–24. doi:10.1126/science.abc6027.

13. Anurag A, Jha PK, Kumar A. Differential white blood cell count in the covid-19: A cross-sectional study of 148 patients. Diabetes & Metabolic Syndrome: Clinical Research & Reviews. 2020;14:2099–102. doi:https://doi.org/10.1016/j.dsx.2020.10.029.

14. Kalfaoglu B, Almeida-Santos J, Tye CA, Satou Y, Ono M. T-cell hyperactivation and paralysis in severe covid-19 infection revealed by single-cell analysis. Frontiers in Immunology. 2020;11:2605. doi:10.3389/fimmu.2020.589380.